makeHopeLive

Intruduction

This time, we will use Prosper loan data to do the following basic analysis, what we want to find is that what factors will impact the APR (Annuall Percentage Rate) and build one prediction model. Meanwhile, we want to let Borrower know how they can reduce their BorrowerAPR.

Prepare Data

Show the variables’ basic meaning

Variable Description
ListingKey Unique key for each listing, same value as the ‘key’ used in the listing object in the API.
ListingNumber The number that uniquely identifies the listing to the public as displayed on the website.
ListingCreationDate The date the listing was created.
CreditGrade The Credit rating that was assigned at the time the listing went live. Applicable for listings pre-2009 period and will only be populated for those listings.
Term The length of the loan expressed in months.
LoanStatus The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket.
ClosedDate Closed date is applicable for Cancelled, Completed, Chargedoff and Defaulted loan statuses.
BorrowerAPR The Borrower’s Annual Percentage Rate (APR) for the loan.
BorrowerRate The Borrower’s interest rate for this loan.
LenderYield The Lender yield on the loan. Lender yield is equal to the interest rate on the loan less the servicing fee.
EstimatedEffectiveYield Effective yield is equal to the borrower interest rate (i) minus the servicing fee rate, (ii) minus estimated uncollected interest on charge-offs, (iii) plus estimated collected late fees. Applicable for loans originated after July 2009.
EstimatedLoss Estimated loss is the estimated principal loss on charge-offs. Applicable for loans originated after July 2009.
EstimatedReturn The estimated return assigned to the listing at the time it was created. Estimated return is the difference between the Estimated Effective Yield and the Estimated Loss Rate. Applicable for loans originated after July 2009.
ProsperRating (numeric) The Prosper Rating assigned at the time the listing was created: 0 - N/A, 1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA. Applicable for loans originated after July 2009.
ProsperRating (Alpha) The Prosper Rating assigned at the time the listing was created between AA - HR. Applicable for loans originated after July 2009.
ProsperScore A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009.
ListingCategory The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans
BorrowerState The two letter abbreviation of the state of the address of the borrower at the time the Listing was created.
Occupation The Occupation selected by the Borrower at the time they created the listing.
EmploymentStatus The employment status of the borrower at the time they posted the listing.
EmploymentStatusDuration The length in months of the employment status at the time the listing was created.
IsBorrowerHomeowner A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner.
CurrentlyInGroup Specifies whether or not the Borrower was in a group at the time the listing was created.
GroupKey The Key of the group in which the Borrower is a member of. Value will be null if the borrower does not have a group affiliation.
DateCreditPulled The date the credit profile was pulled.
CreditScoreRangeLower The lower value representing the range of the borrower’s credit score as provided by a consumer credit rating agency.
CreditScoreRangeUpper The upper value representing the range of the borrower’s credit score as provided by a consumer credit rating agency.
FirstRecordedCreditLine The date the first credit line was opened.
CurrentCreditLines Number of current credit lines at the time the credit profile was pulled.
OpenCreditLines Number of open credit lines at the time the credit profile was pulled.
TotalCreditLinespast7years Number of credit lines in the past seven years at the time the credit profile was pulled.
OpenRevolvingAccounts Number of open revolving accounts at the time the credit profile was pulled.
OpenRevolvingMonthlyPayment Monthly payment on revolving accounts at the time the credit profile was pulled.
InquiriesLast6Months Number of inquiries in the past six months at the time the credit profile was pulled.
TotalInquiries Total number of inquiries at the time the credit profile was pulled.
CurrentDelinquencies Number of accounts delinquent at the time the credit profile was pulled.
AmountDelinquent Dollars delinquent at the time the credit profile was pulled.
DelinquenciesLast7Years Number of delinquencies in the past 7 years at the time the credit profile was pulled.
PublicRecordsLast10Years Number of public records in the past 10 years at the time the credit profile was pulled.
PublicRecordsLast12Months Number of public records in the past 12 months at the time the credit profile was pulled.
RevolvingCreditBalance Dollars of revolving credit at the time the credit profile was pulled.
BankcardUtilization The percentage of available revolving credit that is utilized at the time the credit profile was pulled.
AvailableBankcardCredit The total available credit via bank card at the time the credit profile was pulled.
TotalTrades Number of trade lines ever opened at the time the credit profile was pulled.
TradesNeverDelinquent Number of trades that have never been delinquent at the time the credit profile was pulled.
TradesOpenedLast6Months Number of trades opened in the last 6 months at the time the credit profile was pulled.
DebtToIncomeRatio The debt to income ratio of the borrower at the time the credit profile was pulled. This value is Null if the debt to income ratio is not available. This value is capped at 10.01 (any debt to income ratio larger than 1000% will be returned as 1001%).
IncomeRange The income range of the borrower at the time the listing was created.
IncomeVerifiable The borrower indicated they have the required documentation to support their income.
StatedMonthlyIncome The monthly income the borrower stated at the time the listing was created.
LoanKey Unique key for each loan. This is the same key that is used in the API.
TotalProsperLoans Number of Prosper loans the borrower at the time they created this listing. This value will be null if the borrower had no prior loans.
TotalProsperPaymentsBilled Number of on time payments the borrower made on Prosper loans at the time they created this listing. This value will be null if the borrower had no prior loans.
OnTimeProsperPayments Number of on time payments the borrower had made on Prosper loans at the time they created this listing. This value will be null if the borrower has no prior loans.
ProsperPaymentsLessThanOneMonthLate Number of payments the borrower made on Prosper loans that were less than one month late at the time they created this listing. This value will be null if the borrower had no prior loans.
ProsperPaymentsOneMonthPlusLate Number of payments the borrower made on Prosper loans that were greater than one month late at the time they created this listing. This value will be null if the borrower had no prior loans.
ProsperPrincipalBorrowed Total principal borrowed on Prosper loans at the time the listing was created. This value will be null if the borrower had no prior loans.
ProsperPrincipalOutstanding Principal outstanding on Prosper loans at the time the listing was created. This value will be null if the borrower had no prior loans.
ScorexChangeAtTimeOfListing Borrower’s credit score change at the time the credit profile was pulled. This will be the change relative to the borrower’s last Prosper loan. This value will be null if the borrower had no prior loans.
LoanCurrentDaysDelinquent The number of days delinquent.
LoanFirstDefaultedCycleNumber The cycle the loan was charged off. If the loan has not charged off the value will be null.
LoanMonthsSinceOrigination Number of months since the loan originated.
LoanNumber Unique numeric value associated with the loan.
LoanOriginalAmount The origination amount of the loan.
LoanOriginationDate The date the loan was originated.
LoanOriginationQuarter The quarter in which the loan was originated.
MemberKey The unique key that is associated with the borrower. This is the same identifier that is used in the API member object.
MonthlyLoanPayment The scheduled monthly loan payment.
LP_CustomerPayments Pre charge-off cumulative gross payments made by the borrower on the loan. If the loan has charged off, this value will exclude any recoveries.
LP_CustomerPrincipalPayments Pre charge-off cumulative principal payments made by the borrower on the loan. If the loan has charged off, this value will exclude any recoveries.
LP_InterestandFees Pre charge-off cumulative interest and fees paid by the borrower. If the loan has charged off, this value will exclude any recoveries.
LP_ServiceFees Cumulative service fees paid by the investors who have invested in the loan.
LP_CollectionFees Cumulative collection fees paid by the investors who have invested in the loan.
LP_GrossPrincipalLoss The gross charged off amount of the loan.
LP_NetPrincipalLoss The principal that remains uncollected after any recoveries.
LP_NonPrincipalRecoverypayments The interest and fee component of any recovery payments. The current payment policy applies payments in the following order: Fees, interest, principal.
PercentFunded Percent the listing was funded.
Recommendations Number of recommendations the borrower had at the time the listing was created.
InvestmentFromFriendsCount Number of friends that made an investment in the loan.
InvestmentFromFriendsAmount Dollar amount of investments that were made by friends.
Investors The number of investors that funded the loan.

Show the structure of the data set.

## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ CreditGrade                        : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                         : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                         : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus                   : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
##  $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...

Wow, 81 variables, since not familar with loan data features, need to understand them with the following exploration and web searching. 113937 obs, not small.

After checking the summary of the data set, why so many NA values? Part of them have the same NA count, what causes the NA value? Confused. The information should be calculated automatically, e.g., EstimatedEffectiveYield.

Why duplicate ListingKey? Subset the duplicate ListingKey data for example.

ListingKey ProsperScore
13079 17A93590655669644DB4C06 4
14889 17A93590655669644DB4C06 8
20570 17A93590655669644DB4C06 7
31451 17A93590655669644DB4C06 10
42751 17A93590655669644DB4C06 5
42752 17A93590655669644DB4C06 6

The only difference is ProsperScore, how will cause the ProsperScore to change? not understood. So, for each loan data, if prosperScore changes, will be saved several times?

Univariate Plots Section

The main objective for this article is to find what factors will impact the BorrowerAPR, so we want to know the BorrowerAPR distribution firstly.

BorrowerAPR

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00653 0.15629 0.20976 0.21883 0.28381 0.51229      25

The most frequent BorrowerAPR should still be around 0.2. Another peak is around 0.36

Then we want to explore each variable one by one

CreditGrade

##          AA     A     B     C     D     E  NA's 
## 85125  3509  3315  4389  5649  5153  3289  3508

What’s the meaning of ‘NC’? not correct? remove ‘NC’. C Credit Grade is with big probabiltiy, lots of loan records have no CreditGrade information, this is reasonable, because CreditGrade is used for assessing the loan before 2009 July. After 2009 July, we will use ProsperRating for each loan. As we know, CreditGrade or ProsperRating should be one import factor that impacts the APR. There’s no “HR” level in CreditGrade, is ‘NA’ ‘HR’ level?

Something wrong here, NA value should be ‘HR’, change NA to HR.

##          AA     A     B     C     D     E    HR 
## 85125  3509  3315  4389  5649  5153  3289  3508

Term

##    12    36    60 
##  1614 87778 24545

Doubt that may Term 12 has been canceled in the latest prosper loan, however, from the data, the creation time is not old, so this thought is wrong.

There are just three values for Term variable, the most frequent one is 36, three years.

LoanStatus

##              Cancelled             Chargedoff              Completed 
##                      5                  11992                  38074 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                   5018                    205 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     16                    806                    265 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    363                    313                    304

This LoanStatus is not the feature we care about for BorrowerAPR prediction, however, this one may can be used for predicting what kind of loan will be charged-off. This status WOW me, the probabiltiy for defaulted and charged-off is not small.

BorrowerRate

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1340  0.1840  0.1928  0.2500  0.4975

This feature is highly related with BorrowerAPR, BorrowerAPR = BorrowerRate + OrganizationFee. Will check whether organizationFee changes with Credit Grade or not, from the introduction is Prosper company, seems yes.

ProsperRating..numeric.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   3.000   4.000   4.072   5.000   7.000   29084

Numerice and alpha value describe the same thing, so can just keep one

ProsperRating..Alpha.

##          AA     A     B     C     D     E  NA's 
## 29084  5372 14551 15581 18345 14274  9795  6935

Something wrong here, NA value should be ‘HR’, change NA to HR.

##          AA     A     B     C     D     E    HR 
## 29084  5372 14551 15581 18345 14274  9795  6935

As we talked before, combine CreditGrade and ProsperRating two columns to one column CreditRating that can describe the credit value.

Create CreditRating

##           A    AA     B     C     D     E    HR 
##   272 17866  8881 19970 23994 19427 13084 10443

Still have 131 loans that are with no CreditRating information.

##          AA     A     B     C     D     E    HR 
##   272  8881 17866 19970 23994 19427 13084 10443

All the credit information is combined. best -> worst, ‘AA’ -> ‘HR’.

ProsperScore

##     1     2     3     4     5     6     7     8     9    10    11  NA's 
##   992  5766  7642 12595  9813 12278 10597 12053  6911  4750  1456 29084

why there’s 11? in data decription file, 10 should be the highest value. What ever, best -> worst, 11 -> 1.

ListingCategory..numeric.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   1.000   2.774   3.000  20.000

The biggest probability is used for Debt consolication.

BorrowerState

Did not understand the meaning of the two letter abbreviation.

Occupation

The occupation should not be one key feature. Most borrowers choose the professional.

EmploymentStatus

##                    Employed     Full-time Not available  Not employed 
##          2255         67322         26355          5347           835 
##         Other     Part-time       Retired Self-employed 
##          3806          1088           795          6134

The classes of EmploymentStatus are weired, employed, not employed, what’s the relationshipe between employed and Full-time, Part-time. Want to combine the levels to just two, employed and not employed.

Create EmploymentFlag

##     Employed Not employed 
##       113102          835

Most borrowers are employed.

EmploymentStatusDuration

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   26.00   67.00   96.07  137.00  755.00    7625

This is one long tail variable.

IsBorrowerHomeowner

## False  True 
## 56459 57478

False and True are nearly 50% and 50%.

CreditScoreRangeLower and Upper should be combined into one range column, like income range.

Create CreditScoreRange

Most Borrowers credit score in range 640 - 740. The uppper value = lower value + 19, so we can just keep one for the next revision.

Try to build a new feature to reduce the CreditScoreRange level so check whether will improve the relationship

Create CreditScoreRevision

Add one variable to judge the length of credit history, the longer history, should the lower APR.

Create LengthHistory

The unit is day.

OpenRevolvingAccounts

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    4.00    6.00    6.97    9.00   51.00

The most frequent count is around 5.

InquiriesLast6Months and TotalInquiries

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   1.000   1.435   2.000 105.000     697
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   2.000   4.000   5.584   7.000 379.000    1159

Most InquiriesLast6Months are smaller than 2, outliers can be up to around 100. Most TotalInquiries are smaller than 10, some of outliers can be up to around 300.

CurrentDelinquencies and AmountDelinquent

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.5921  0.0000 83.0000     697
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##      0.0      0.0      0.0    984.5      0.0 463881.0     7622

75% CurrentDelinquencies is 0, most of CurrentDelinquence is smaller than 10, the outlier can be up to 80. The most frequent amountDelinquent not 0 is around 1000.

DelinquenciesLast7Years and TradesNeverDelinquent..percentage.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   4.155   3.000  99.000     990
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.820   0.940   0.886   1.000   1.000    7544

The variance of TradesNeverDelinquent..percentage. is not so small, except the 25% borrowers, the other ones all have ever missed the repayment time. However, for DelinquenciesLast7Years, 50% borrowers have 0 delinquence.

PublicRecordsLast12Months and PublicRecordsLast10Years

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   0.015   0.000  20.000    7604
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.3126  0.0000 38.0000     697

For this type of variable, actually, want to check, if ignore 0 value, will increase the relationshipe between it and BorrowerAPR?

RevolvingCreditBalance

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    3121    8549   17599   19521 1435667    7604

The most frequent value is around 1e+04

BankcardUtilization and AvailableBankcardCredit

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.310   0.600   0.561   0.840   5.950    7604
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0     880    4100   11210   13180  646285    7544

Most BankcardUtilization is smaller than 1, the most frequent AvailableBankcardCredit is around 10000.

DebtToIncomeRatio

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

The most frequent one is around 0.2.

IncomeRange

What’s the meaning of ‘not displayed’, what’s the difference between ‘not employed’ and ‘$0’?

IncomeVerifiable

##  False   True 
##   8669 105268

Most borrowers are IncomeVerifiable

Save the revision data

Univariate Analysis

What is the structure of your dataset?

There are 113937 obs. of 81 variables in this dataset. Each observation is one loan record. The dataset is collected from Prosper webBank, who is America’s frist peer-to-peer lending marketplace. Borrowers request personal loans and inverstor fund. Knowing more background will make us understand the data more easily and clearly.

What is/are the main feature(s) of interest in your dataset?

Since this is one loan dataset, we care most is the BorrowerAPR. We want to known what factors will impact the BorrowerAPR and build one prediction model. The basic one should be CreditRating, however, some combination of other variables should also be used to build the prediction model.

What other features in the dataset do you think will help support your

investigation into your feature(s) of interest?

In my view, the features may imapct the BorrowerAPR includes at least: ProsperScore, EmploymentStatus, EmploymentFlag, IsBorrowerHomeowener, CreditScoreRange, OpenRevolvingAccounts, InquireisLast6Months, CurrentDelinquencies, AmountDelinquent, DelinquenciesLast7Years, PublicRecordsLast10Years, PublicRecordsLast12Months, BankcardUtilization, AvailableBankcardCredit, TradesNeverDelinquent, DebtToIncomeRatio, IncomeRange, IncomeVerifiable, LengthHistory etc.

After lots of research, find that the five important components for credit score, i.e., payment history, credit utilization, length of credit history, new credit and credit mix. The features in data set almost all have relationship with the five components. However, except CreditRating, ProsperScore, CreditScoreRange, which already take acount of the whole history information, Delinquency part, PublicRecords part, BankcardUtilization, AvailableBankcardCredit, DebtToIncomeRatio, LengthHistory should be the most related features.

Did you create any new variables from existing variables in the dataset?

Since CreditGrade is used for period before July, 2009 and ProsperRating is used fo period after July, 2009, i combine them to one CreditRating feature.

Create a variable EmploymentFlag to classify the emopolyment status to employed and not employed, will check more in the Bivariate Plots Section.

Create a variable CreditScoreRange to describe CrediteScoreRangeLower and Upper more directly. And then create a variable CreditRevision to reduce the levels.

Create a variable LengthHistory to judge the length of credit history.

Of the features you investigated, were there any unusual distributions?

Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

GreditGrade and ProsperRating miss ‘HR’ level, change the NA value to ‘HR’, since ProsperRating(numeric) use 1 for ‘HR’.

Change Term, ProsperScore to factor, since just limited levels.

Change FirstRecordedCreditLine and ListingCreationDate to Datetime

Bivariate Plots Section

Select the variables we most care about to get the subset to calculate the relationship between each other

Correlation matrix

Select the relationship: Very Strong (0.8 - 1), Strong (0.6 - 0.8), Moderate (0.4 - 0.6), Weak (0.2 - 0.4)

Var1 Var2 value
BorrowerRate BorrowerAPR 0.9898240
TotalTrades TotalCreditLinespast7years 0.9364824
CreditScoreRevision CreditScoreRange 0.9187899
CreditRating BorrowerRate 0.8725063
CreditRating BorrowerAPR 0.8697732
TotalInquiries InquiriesLast6Months 0.7419499
ProsperScore CreditRating -0.7052214
ProsperScore BorrowerAPR -0.6682872
ProsperScore BorrowerRate -0.6497361
CreditScoreRange CreditRating -0.6361950
CreditScoreRevision CreditRating -0.6178868
IncomeVerifiable DebtToIncomeRatio -0.6005166
CreditScoreRevision AvailableBankcardCredit 0.4718884
TradesNeverDelinquent..percentage. CreditScoreRange 0.4686689
AvailableBankcardCredit CreditScoreRange 0.4536941
BankcardUtilization CreditScoreRange -0.4053380
CreditScoreRevision BankcardUtilization -0.4193442
CreditScoreRange BorrowerAPR -0.4513276
TradesNeverDelinquent..percentage. CurrentDelinquencies -0.4587606
CreditScoreRevision BorrowerAPR -0.4789308
CreditScoreRange BorrowerRate -0.4834341
CreditScoreRevision BorrowerRate -0.4944032
TradesNeverDelinquent..percentage. DelinquenciesLast7Years -0.5164432
LengthHistory TotalTrades 0.3981152
CreditScoreRevision TradesNeverDelinquent..percentage. 0.3969301
AvailableBankcardCredit CreditRating -0.3880894
DelinquenciesLast7Years CurrentDelinquencies 0.3777769
CurrentDelinquencies CreditScoreRange -0.3738819
CreditScoreRange ProsperScore 0.3696030
LengthHistory TotalCreditLinespast7years 0.3678176
CreditScoreRevision ProsperScore 0.3587978
AvailableBankcardCredit BankcardUtilization -0.3508306
AvailableBankcardCredit BorrowerAPR -0.3489261
AvailableBankcardCredit BorrowerRate -0.3438611
AmountDelinquent CurrentDelinquencies 0.3405485
AvailableBankcardCredit ProsperScore 0.3185580
TotalTrades IsBorrowerHomeowner 0.3174058
TradesNeverDelinquent..percentage. CreditRating -0.3157106
CreditScoreRange IsBorrowerHomeowner 0.3023572
InquiriesLast6Months ProsperScore -0.2967619
BankcardUtilization CreditRating 0.2967502
CreditScoreRevision IsBorrowerHomeowner 0.2949463
TotalCreditLinespast7years IsBorrowerHomeowner 0.2935867
TotalInquiries CreditScoreRange -0.2932329
InquiriesLast6Months CreditScoreRange -0.2693860
BankcardUtilization BorrowerAPR 0.2614380
TradesNeverDelinquent..percentage. BorrowerRate -0.2611895
DelinquenciesLast7Years CreditScoreRange -0.2598307
BankcardUtilization BorrowerRate 0.2554820
IncomeVerifiable EmploymentFlag -0.2551694
TotalTrades AvailableBankcardCredit 0.2499171
BankcardUtilization ProsperScore -0.2446956
CurrentDelinquencies CreditRating 0.2441267
TradesNeverDelinquent..percentage. BorrowerAPR -0.2413489
TradesNeverDelinquent..percentage. AvailableBankcardCredit 0.2384296
CreditScoreRevision DelinquenciesLast7Years -0.2364492
CreditScoreRevision CurrentDelinquencies -0.2345581
DelinquenciesLast7Years AmountDelinquent 0.2332703
LengthHistory CreditScoreRange 0.2260040
CreditScoreRevision TotalInquiries -0.2253238
TotalInquiries ProsperScore -0.2157662
InquiriesLast6Months CreditRating 0.2155871
IncomeRange EmploymentFlag 0.2123003
LengthHistory IsBorrowerHomeowner 0.2007115

From the correlation figure, we can see that there’s very strong relationship between variable BorrowerAPR and CreditRating, this meets our expectation, the higher CreditRating is, the lower BorrowerAPR should be.

Moreover, there’s strong relationship between variable BorrowerAPR and ProsperScore while meanwhile CreditRating also has strong relationship with ProsperScore.

BorrowerAPR has moderate relationship with CreditScoreRevision while CreditRating also has strong relationship with CreditScoreRevision.

BorrowerAPR has weak relationship with BankcardUtilization, AvaliableBankcardCredit and TradesNeverDelinquent. Meanwhile, they have weak relationship between each other. Moreover, they have moderate relationship with CreditScoreRange.

InquiriesLast6Months has strong relationship with TotalInquiries, reasonable.

Intesting, IncomeVefiable has strong relationship with DebtToIncomeRatio, why?

TradesNeverDelinquent has moderate relationship with CurrentDelinquencies, DelinquenciesLast7Years, which makes sense.

lots of weak relationship.

The created variables’ value is not that obvious. CreditScoreRevision imporves a little compared CreditScoreRange. Will use CreditScoreRevision for the following analysis.

have not found the variables that are related to BorrowerAPR while not related to CreditRating

Orgination fee of CreditRating

One idea bingo, want to know whether the orgination fee changes with CreditRating, so we will build one variable OrginationFee and visulize it.

We can see that the orgination fee also changes with the CreditRationg level. The obvious diff is between level ‘AA’ and level “A”.

BorrowerAPR of CreditRating

The result meets our expectation as mentioned before, However, why there are lots of outliers and the variance is not small? it seems there are still other variables control the BorrowerAPR trendency, their influence can not be ignored.

Keep CreditRating fixed, chech the influence of ProsperScore.

BorrowerAPR of ProsperScore

As the correlation value calculated before, strong relationship between BorrowerAPR and ProsperScore.

Plot the relationship between ProsperScore and CreditRating. Confused, how to get the CreditRating? how to calculate the ProsperScore?

CreditRating vs ProsperScore

From the figure, we can see, part of ‘AA’ borrowers still have high risk score, e.g., 4. Therefore, we still should keep the ProsperScore feature, it can descirbe a different dimension for BorrowerAPR prediction.

BorrowerAPR of CreditScoreRange

The sample number in both tail side is not enough, e.g., [0 - 440]. The score is got from customer credit rating agency, still that quesion, how to get credit rating?

CreditRating vs CreditScoreRange

From the figure, we can see CreditRating ‘AA’ borrowers may have a low CreditScore, why? Whatever, CreditScoreRange is still one important feature for the prediction.

Try to reduce the levels for CreditScoreRange to check whether can imporve the correlation value by this way.

BorrowerAPR of CreditScoreRevision

##   (0,640] (640,680] (680,720] (720,760] (760,800] (800,840] (840,880] 
##     26605     32858     28394     15873      7268      1976       239 
##      NA's 
##       724

Emmm, more clear than CreditScoreRange.

BorrowerAPR of BankcardUtilization

If understand correctly, this BankcardUtilization should mean ratio of your credit card balances to credit limits. The higher the BankcardUtilization, the higher BorrowerAPR, because high BankcardUtilization will make lender to think that there’s an increased risk.

BorrowerAPR of BankcardUtilization

Bases on before correlation calculation output, we know AvailableBankcardCredit should have weak relationship with BorrowerAPR, this figure shows this. With present knowledge, credit limits = credit balance + credit pending transaction + avaliable credit. This feature can show the credit limits inderectly.

BorrowerAPR of TradesNeverDelinquent..percentage.

The higher radesNeverDelinquent..percentage., the lower BorrowerAPR.

BorrowerAPR of EmploymentStatus and EmploymentFlag

There’s no clear trend in EmploymentStatus figure, however, we can see the BorrowerAPR of ‘Not employed’ is obviously larger than other classes. So, create a new variable EmploymentFlag to check more.

This should be clear now, Employed may get a low Borrower APR, but why the correlation value is low between BorrowerAPR and EmployedFlag? If we check more carefully, the tail below 25% seems long.

BorrowerAPR of CurrentDelinquencies and AmountDelinquent

The Trend is not that obvious, just a little move upward. However, for CurrentDelinquencies, the diff betweet 0 and 1 is much bigger than the diff between 1 and 2, 2 and 3.

BorrowerAPR of DelinquenciesLast7Years

The trend is a little more obvious than CurrentDelinquencies and AmountDelinquent.

BorrowerAPR of IsBorrowerHomeowner

The trend is not that obvious.

BorrowerAPR of InquiriesLast6Months

The Trend is not that obvious, just a little move upward.

BorrowerAPR of TotalTrades

The Trend is not that obvious.

BorrowerAPR of DebtToIncomeRatio

As what

BorrowerAPR of OpenRevolvingAccounts

BorrowerAPR of IncomeRange

0 not reasonable, samples are not enough. What’s the difference between ‘$0’ and ‘not employed’. The trend move a little downward.

BorrowerAPR of IncomeVerifiable

BorrowerAPR of PublicRecordsLast10Years

The same with CurrentDelinquencies, the diff is much larger between 0 and 1. For this kind of variable, acturally, we can build a new variable, 0 and not 0, to increase the relationship, am i right?

IncomeVerifiable vs DebtToIncomeRatio

50% DebtToincome is 10.1 when IncomeVerifiable is false, this is the max value for DebtToincome, what’s the meaning of 10.1? it seems that the strong relationship between DebtToIncomRatio and IncomeVerifiable is meanless and not useful for our objective.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. How did the feature(s) of interest vary with other features in
the dataset?

There’s very strong relationship between variable BorrowerAPR and CreditRating, this meets our expectation, the higher CreditRating is, the lower BorrowerAPR should be.

Moreover, there’s strong relationship between variable BorrowerAPR and ProsperScore while meanwhile CreditRating also has strong relationship with ProsperScore. Moreover, from the bar plot, part of ‘AA’ borrowers still have high risk score, e.g., 4. Therefore, we still should keep the ProsperScore feature, it can descirbe a different dimension for BorrowerAPR prediction.

BorrowerAPR has moderate relationship with CreditScoreRange while CreditRating also has moderate relationship with CreditScoreRange. Moreover, we can see CreditRating ‘AA’ borrowers may have a low CreditScore, why? Whatever, CreditScoreRange is still one feature for the prediction.

BorrowerAPR has weak relationship with BankcardUtilization, AvaliableBankcardCredit and TradesNeverDelinquent. Meanwhile, they have weak relationship between each other. Moreover, they have moderate relationship with CreditScoreRange.

What confuses me is that, where we get the CreditRating, ProsperScore? All these features should combine the important credit information, like, payment history, credit utilization, length of creit history, new credit, credit mix etc, all of them are history feature.

Did you observe any interesting relationships between the other features

(not the main feature(s) of interest)?

InquiriesLast6Months has strong relationship with TotalInquiries, reasonable. TradesNeverDelinquent has moderate relationship with CurrentDelinquencies, DelinquenciesLast7Years, which makes sense. All these prove that the history can predict the present status.

IncomeVerifiable has strong relationship with DebtToIncomeRatio, then find that when IncomeVeriiable is False, 50% DebtToIncomeRatio is max value 10.1, do not know the meaning of this value, this relathionship should be not useful for our objective.

What was the strongest relationship you found?

BorrowerAPR and CreditRating

Multivariate Plots Section

BorrowerAPR vs CreditRating vs ProsperScore

The trend is not that obvious, we can still see the small downward with Prosper. However, seems not for D and E.

BorrowerAPR vs CreditRating vs CreditScoreRevision

BorrowerAPR vs CreditRating vs BankcardUtilization

We know there’s weak relationship between CreditRating and BankcardUtilization, the figure proves that, the most obvious one is that ‘AA’ borrowers tend to have small BankcardUtilization, ‘HR’ and ‘E’ borrowers tend to have big BankcardUtilization.

BorrowerAPR vs CreditRating vs TradesNeverDelinquent..percentage.

From the figure, we can see the weak relationship between TradesNeverDelinquent..percentage. and CreditRating, more ‘HR’ borrowers have small TradesNeverDelinquent..percentage. compared with ‘AA’ borrower. Moreover, we can see that if keep the creditRating fixed, smaller TradesNeverDelinquent..percentage tends to have bigger BorrowerAPR.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

BorrowerAPR has very strong relationship with CreditRating, strong relationship with ProsperScore, moderate relationship with CreditScoreRevision, weak relationship with BankcardUtilization, AvaliableBankcardCredit and TradesNeverDelinquent. Moreover, all these features have strong, moderate, weak relationship with CreditRating. However, if we dig deeper, can find each of them can describe a different dimension view of data. Where, wonder how to get the CreditRating value? calculated with history data, which includes payment history, credit utilization, length of creit history, new credit, credit mix etc? how to get the ProsperScore? how to get CreditScoreRange? what’s their difference?

Were there any interesting or surprising interactions between features?

lots of features have strong and moderate relationship with CreditRating. If all these features, i.e., CreditRating, ProsperScore, CreditScoreRevision, take account of history data, why they are different?, why we need them all?

OPTIONAL: Did you create any models with your dataset? Discuss the

strengths and limitations of your model.

Want to build a model to predict the BorrowerAPR, will do this in future, can use Machine Learning Algrithms.

Final Plots and Summary

Plot One

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00653 0.15629 0.20976 0.21883 0.28381 0.51229      25

Description One

From this figure, we can see several peaks, the most frequent one is around 0.2 and another bigger peak is around 0.36

Plot Two

Description Two

The higher CreditRating is, the lower BorrowerAPR will be. However, the variance for each CreditRating level is not small, There are still some other imported features to control the finnal BorrowerAPR. From the trendency, it seems that one linear model can be built to predict the BorrowerAPR.

Plot Three

Description Three

From the figure, we can see the weak relationship between TradesNeverDelinquent..percentage. and CreditRating, more ‘HR’ borrowers have small TradesNeverDelinquent..percentage. compared with ‘AA’ borrower. Moreover, we can see that if keep the creditRating fixed, smaller TradesNeverDelinquent..percentage tends to have bigger BorrowerAPR.


Reflection

This is one loan data set, it takes me huge time to understand each variable, which includes understand the variables based on excel data description, search the related knowledge of credit loan and Prosper WebBank etc.

Then i try to explore the data by Univariate Plot and Bivariate Plot, begin to know what i can find based on the data. BorrowerAPR, yes, that’s what borrower and lender most care about. For borrowers, they want to know how can reduce the Borrower APR; For lenders, they want to know, what kind of loan will help them make lots of money and minimize their loss. I find lots of information in Bivariate Plot part. When the correlation matrix is calculated, i compare the value with the plot and finally totally understand what’s going on here.

In order to predict the BorrowerAPR better, i create some new variables, e.g, CreditRating, CreditScoreRange, CreditScoreRevision, EmploymentFlag, LengthHistory. However, the influence is not that obvious, just the relationship between BorrowerAPR and CreditScoreRevision improves a little compared with the relationship between BorrowerAPR and CreditScoreRange. Actually, i also find some variables, e.g., CurrentDelinquencies, PublicRecordsLast10Years etc, the difference bettween 0 and 1 is much larger, in future, will reduce their levels to 0 and not 0 to check more details.

BorrowerAPR has very strong relationship with CreditRating, strong relationship with ProsperScore, moderate relationship with CreditScoreRevision, weak relationship with BankcardUtilization, AvaliableBankcardCredit and TradesNeverDelinquent. Where, CreditRating has strong relationship with ProsperScore, moderate relationship with CreditScoreRevision. However, ProsperScore and CreditScoreRevision can describe the different view with CreditRating, therefore, they should be both features in this prediction model. Moreover, CreditScoreRevision has moderate relationship with BankcardUtilization, AvaliableBankcardCredit and TradesNeverDelinquent. Have not found the features that are not related to CreditRating but have relationship with BorrowerAPR, which makes sense, since all the score data (i.e., CreditRating, ProsperScore, CreditScoreRevision) has already taken account of all the history data.

I am still confused, how to get the CreditRating value? Calculated with history data, which includes payment history, credit utilization, length of creit history, new credit, credit mix etc? how to get the ProsperScore? Why so many score data in this data set?

Lots of questions have not been answers, need more clues to answer these questions. In other words, still not clear for part of data in the dataset. This part information can not be got by exploring data, should contact the data collectors for more details.

In the future, will build the prediction model to predict the BorrowerAPR. Will split the dataset to training data and testing data, build one model using Machine Learning Algrithms.